Nadira Noor, Sijing Liu, Yingjian Zhang
Records of Project (DSCI 550)
Sijing, Liu - 12/2021
===========================================================
Food categories (as we defined)
Healthy diet
Unhealthy diet
import numpy as np
import numpy as em
import pandas as pd
import plotly.express as px
import seaborn as sns
sns.set_palette("Set2")
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor
from xgboost.sklearn import XGBRegressor
intake = pd.read_csv("../#DSCI550/1_for_use_Food_intake.csv")
intake.head()
| Country | Aquatic Products, Other + Offals + Fish, Seafood | Cereals - Excluding Beer | Eggs + Milk - Excluding Butter | Fruits - Excluding Wine | Pulses | Starchy Roots | Treenuts | Vegetables + Vegetal Products | Animal fats + Animal Products + Meat | Oilcrops + Vegetable Oils | Sugar & Sweeteners + Sugar Crops | Obesity | Undernourished | Confirmed | Deaths | Confirmed (%) | Deaths (%) | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | 0.2407 | 24.8097 | 7.7927 | 5.3495 | 0.2953 | 0.8802 | 0.0770 | 47.3287 | 10.8334 | 0.6045 | 1.3489 | 4.5 | 29.8 | 156210 | 7272 | 0.004 | 0.000 | 38928000 |
| 1 | Albania | 0.4450 | 5.7817 | 16.3028 | 6.7861 | 0.2380 | 1.8096 | 0.1515 | 43.0057 | 20.7886 | 1.2638 | 1.5367 | 22.3 | 6.2 | 184887 | 2916 | 0.065 | 0.001 | 2838000 |
| 2 | Algeria | 0.3286 | 13.6816 | 8.1466 | 6.3801 | 0.4783 | 4.1340 | 0.1152 | 52.0135 | 10.7921 | 1.3803 | 1.8342 | 26.6 | 3.9 | 206358 | 5918 | 0.005 | 0.000 | 44357000 |
| 3 | Angola | 1.9257 | 9.1085 | 0.8898 | 6.0005 | 0.6507 | 18.1102 | 0.0061 | 47.3763 | 7.0409 | 1.0649 | 1.8495 | 6.8 | 25.0 | 64374 | 1708 | 0.002 | 0.000 | 32522000 |
| 4 | Argentina | 0.8472 | 8.4102 | 11.2307 | 6.0435 | 0.0528 | 3.0420 | 0.0200 | 35.0062 | 26.6109 | 0.9657 | 3.0536 | 28.5 | 4.6 | 5288259 | 115942 | 0.117 | 0.003 | 45377000 |
intake.describe()
| Aquatic Products, Other + Offals + Fish, Seafood | Cereals - Excluding Beer | Eggs + Milk - Excluding Butter | Fruits - Excluding Wine | Pulses | Starchy Roots | Treenuts | Vegetables + Vegetal Products | Animal fats + Animal Products + Meat | Oilcrops + Vegetable Oils | Sugar & Sweeteners + Sugar Crops | Obesity | Undernourished | Confirmed | Deaths | Confirmed (%) | Deaths (%) | Population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 153.000000 | 1.530000e+02 | 153.000000 | 153.000000 | 153.000000 | 1.530000e+02 |
| mean | 1.485862 | 12.126960 | 7.292746 | 5.505375 | 0.565020 | 5.336005 | 0.123241 | 43.796318 | 15.712421 | 1.445395 | 2.867141 | 18.247059 | 11.221569 | 1.538362e+06 | 30902.006536 | 0.052412 | 0.000856 | 4.804658e+07 |
| std | 1.215787 | 5.928935 | 5.182477 | 3.142960 | 0.619181 | 5.570936 | 0.149481 | 6.676394 | 7.324163 | 1.188277 | 1.489577 | 9.417048 | 11.911941 | 5.058239e+06 | 92183.424743 | 0.049409 | 0.001041 | 1.641521e+08 |
| min | 0.199000 | 3.401400 | 0.142000 | 0.659600 | 0.001000 | 0.679600 | 0.000000 | 26.945700 | 2.321700 | 0.210000 | 0.366600 | 2.100000 | 2.000000 | 3.712000e+03 | 28.000000 | 0.000000 | 0.000000 | 7.200000e+04 |
| 25% | 0.722600 | 7.298200 | 2.600500 | 3.506100 | 0.147200 | 1.997500 | 0.023000 | 39.389000 | 10.105900 | 0.824600 | 1.774400 | 8.200000 | 2.000000 | 5.214100e+04 | 913.000000 | 0.006000 | 0.000000 | 4.020000e+06 |
| 50% | 1.157200 | 10.536500 | 6.460100 | 4.923000 | 0.330600 | 3.111300 | 0.084100 | 45.151700 | 16.014800 | 1.236900 | 2.611000 | 21.300000 | 7.000000 | 3.042410e+05 | 4675.000000 | 0.042000 | 0.001000 | 1.071600e+07 |
| 75% | 1.853700 | 16.146500 | 11.473000 | 6.786100 | 0.795500 | 5.502200 | 0.151500 | 48.492400 | 21.504400 | 1.729500 | 3.829700 | 25.700000 | 15.200000 | 9.156030e+05 | 18268.000000 | 0.087000 | 0.001000 | 3.504100e+07 |
| max | 8.804600 | 29.804500 | 21.235700 | 19.302800 | 3.483800 | 27.712800 | 0.756900 | 57.982600 | 34.367900 | 10.767000 | 9.725900 | 37.300000 | 59.600000 | 4.595319e+07 | 745668.000000 | 0.231000 | 0.006000 | 1.402385e+09 |
food_mean = intake.describe().iloc[1]
food_mean = pd.DataFrame(food_mean).drop(['Obesity', 'Undernourished', 'Confirmed', 'Deaths', 'Confirmed (%)', 'Deaths (%)', 'Population'], axis=0)
food_mean = food_mean.sort_values(by='mean', ascending=False)
food_mean_plot = food_mean.plot.pie(subplots=True, figsize=(14, 10),autopct='%1.1f%%')
As the figure shown belown, Vegetables + Vegetal Products (45.5%), which are categorized as healthy food are the most consumed by people worldwidely, followed by Animal fats + Animal Products + Meat (16.3%) and Cereals - Excluding Beer (12.6%).
Malnutrition occurs when the body doesn't get enough or balanced nutrients(WHO, 2020).
It covers 2 broad groups of conditions: undernutrition and obesity. The world average malnutrition rate is:
(as shown in intake.describe() - mean row)
fig = px.bar(intake, x = "Country", y ="Confirmed").update_xaxes(categoryorder="total descending")
fig.show()
fig = px.bar(intake, x = "Country", y ="Deaths").update_xaxes(categoryorder="total descending")
fig.show()
The United States of America has the most confirmed and deaths cases.
To better describe COVID-19 cases rate, we combine the diagnosed cases and the death using concept of Case fatality rate (CFR):
We calculate the CRF of all countries, which are presented below. In the following analysis, we also use CRF besides COVID-19 case rate.
intake['CRF'] = intake['Deaths']/intake['Confirmed']
intake['CRF']
0 0.046553
1 0.015772
2 0.028678
3 0.026532
4 0.021924
...
148 0.012022
149 0.024061
150 0.192249
151 0.017456
152 0.035170
Name: CRF, Length: 153, dtype: float64
fig = px.bar(intake, x = "Country", y ="CRF").update_xaxes(categoryorder="total descending")
fig.show()
We consider the CRF of Yemen (19.22%) as an outlier and remove it in the following association analysis.
corr_fig = intake[['Confirmed','Animal fats + Animal Products + Meat', 'Aquatic Products, Other + Offals + Fish, Seafood', 'Cereals - Excluding Beer', 'Eggs + Milk - Excluding Butter', 'Fruits - Excluding Wine', 'Oilcrops + Vegetable Oils', 'Pulses', 'Starchy Roots', 'Sugar & Sweeteners + Sugar Crops', 'Treenuts', 'Vegetables + Vegetal Products']]
x = corr_fig.corr(method='pearson')
plt.figure(figsize=(7,5), dpi= 80)
sns.heatmap(x[['Confirmed']].sort_values(by=['Confirmed'],ascending=False),cmap='Pastel2_r',annot=True,linewidth=0.6)
plt.title('Covid confirmed cases diets')
plt.xticks()
corr_fig = intake[['Deaths','Animal fats + Animal Products + Meat', 'Aquatic Products, Other + Offals + Fish, Seafood', 'Cereals - Excluding Beer', 'Eggs + Milk - Excluding Butter', 'Fruits - Excluding Wine', 'Oilcrops + Vegetable Oils', 'Pulses', 'Starchy Roots', 'Sugar & Sweeteners + Sugar Crops', 'Treenuts', 'Vegetables + Vegetal Products']]
x = corr_fig.corr(method='pearson')
plt.figure(figsize=(7,5), dpi= 80)
sns.heatmap(x[['Deaths']].sort_values(by=['Deaths'],ascending=False),cmap='Pastel2_r',annot=True,linewidth=0.6)
plt.title('Covid deaths cases diets')
plt.xticks()
(array([0.5]), [Text(0.5, 0, 'Deaths')])
Generally, the relationship between food consumption and countries' confirmed cases and food consumption and deaths cases are very similar.
The top 1 correlations is Animal fats + Animal Products + Meat.
intake[intake.Obesity < intake['Obesity'].mean()].shape
(62, 21)
intake[intake.Obesity > intake['Obesity'].mean()].shape
(91, 21)
Take the average obesity rate as a boundary, we divide the world into HOC (High Obesity Countries) and LOC (Low Obesity Countries).
high_obesity = intake[intake.Obesity > intake['Obesity'].mean()]
low_obesity = intake[intake.Obesity <= intake['Obesity'].mean()]
intake['ObesityAboveAverage'] = (intake["Obesity"] > intake['Obesity'].mean()).astype(int)
intake['ObesityAboveAverage']
0 0
1 1
2 1
3 0
4 1
..
148 1
149 0
150 0
151 0
152 0
Name: ObesityAboveAverage, Length: 153, dtype: int32
We start by exploring the most decisive food types: Animal fats + Animal Products + Meat. Research shows they may cause obesity.
fig = px.histogram(intake, x = "Animal fats + Animal Products + Meat", nbins=50, color = "ObesityAboveAverage", marginal="rug")
fig.add_shape(
type = "line",
x0 = high_obesity['Animal fats + Animal Products + Meat'].median(),
y0 = 0,
x1 = high_obesity['Animal fats + Animal Products + Meat'].median(),
y1 = 12,
line = dict(color="crimson", width=4),
)
fig.add_shape(
type = "line",
x0 = low_obesity['Animal fats + Animal Products + Meat'].median(),
y0 = 0,
x1 = low_obesity['Animal fats + Animal Products + Meat'].median(),
y1 = 12,
line = dict(color="darkblue", width=4),
)
fig.show()
fig = px.histogram(intake, x = "Vegetables + Vegetal Products", nbins=50, color = "ObesityAboveAverage", marginal="rug")
fig.add_shape(
type = "line",
x0 = high_obesity['Vegetables + Vegetal Products'].median(),
y0 = 0,
x1 = high_obesity['Vegetables + Vegetal Products'].median(),
y1 = 12,
line = dict(color="crimson", width=4),
)
fig.add_shape(
type = "line",
x0 = low_obesity['Vegetables + Vegetal Products'].median(),
y0 = 0,
x1 = low_obesity['Vegetables + Vegetal Products'].median(),
y1 = 12,
line = dict(color="darkblue", width=4),
)
fig.show()
HOC have a higher consumption of Animal fats + Animal Products + Meat (belongs to unhealthy diet) and lower consumption of Vegetables + Vegetal Products (belongs to healthy diet).
corr_map = intake[['Confirmed (%)', 'Deaths (%)', 'Undernourished', 'Obesity']]
x = corr_map.corr(method='pearson')
plt.figure(figsize=(10,8), dpi= 80)
sns.heatmap(x,cmap='Pastel2_r',annot=True,linewidth=0.6)
plt.title('Pearson Correlation Coefficient')
plt.xticks()
(array([0.5, 1.5, 2.5, 3.5]), [Text(0.5, 0, 'Confirmed (%)'), Text(1.5, 0, 'Deaths (%)'), Text(2.5, 0, 'Undernourished'), Text(3.5, 0, 'Obesity')])
obesity has a stronger correlation (positive) with COVID-19 Confirmed/ Deaths than undernourished
fig = px.bar(intake, x = "Country", y ="Deaths", facet_col = "ObesityAboveAverage")
fig.update_xaxes(matches=None,categoryorder="total descending")
fig.show()
HOC have more COVID-19 deaths cases
fig = px.scatter(intake[intake.Country != 'Yemen'], x="Deaths", y = "Obesity", size = "CRF",
hover_name='Country', log_x=False, size_max=30, template="simple_white")
fig.add_shape(
type = "line",
x0 = 0,
y0 = intake[intake.Country != 'Yemen']['Obesity'].mean(),
x1 = intake[intake.Country != 'Yemen']['Deaths'].max(),
y1 = intake[intake.Country != 'Yemen']['Obesity'].mean(),
line = dict(color="crimson", width=4),
)
fig.show()
HOC have higher CRF.
The red line represents the average obesity rate among countries. The size of the points corresponds to the country's COVID-19 CRF.
healthy_features = ['Aquatic Products, Other + Offals + Fish, Seafood', 'Cereals - Excluding Beer', 'Eggs + Milk - Excluding Butter',
'Fruits - Excluding Wine', 'Pulses', 'Starchy Roots', 'Treenuts', 'Vegetables + Vegetal Products']
unhealthy_features = ['Animal fats + Animal Products + Meat', 'Oilcrops + Vegetable Oils', 'Sugar & Sweeteners + Sugar Crops']
intake['Healthy diet'] = intake[healthy_features].sum(axis=1)
intake['Unhealthy diet'] = intake[unhealthy_features].sum(axis=1)
intake.columns
Index(['Country', 'Aquatic Products, Other + Offals + Fish, Seafood',
'Cereals - Excluding Beer', 'Eggs + Milk - Excluding Butter',
'Fruits - Excluding Wine', 'Pulses', 'Starchy Roots', 'Treenuts',
'Vegetables + Vegetal Products', 'Animal fats + Animal Products + Meat',
'Oilcrops + Vegetable Oils', 'Sugar & Sweeteners + Sugar Crops',
'Obesity', 'Undernourished', 'Confirmed', 'Deaths', 'Confirmed (%)',
'Deaths (%)', 'Population', 'Mortality', 'CRF', 'ObesityAboveAverage',
'healthy diet', 'Healthy diet', 'Unhealthy diet'],
dtype='object')
#intake_CRF = intake[intake.Country != 'Yemen'][healthy_features + ['Healthy diet'] + unhealthy_features + ['Unhealthy diet'] + ['Obesity','CRF']]
intake_CRF = intake[intake.Country != 'Yemen'][['Healthy diet','Unhealthy diet','Obesity','CRF']]
intake_CRF = shuffle(intake_CRF)
CRF_features = intake_CRF.columns.drop('CRF')
CRF_target = 'CRF'
print('Model features: ', CRF_features)
print('Model target: ', CRF_target)
X = intake_CRF[CRF_features]
y = intake_CRF[CRF_target]
Model features: Index(['Healthy diet', 'Unhealthy diet', 'Obesity'], dtype='object') Model target: CRF
train_data, test_data = train_test_split(intake_CRF, test_size = 0.2, shuffle = True, random_state = 28)
X_train = train_data[CRF_features]
y_train = train_data[CRF_target]
X_test = test_data[CRF_features]
y_test = test_data[CRF_target]
regressor = Pipeline([
('scaler', StandardScaler()),
('estimator', Ridge(random_state=28))
])
# Training
regressor.fit(X_train, y_train)
# Scoring the training set
train_preds = regressor.predict(X_train)
regressor.score(X_train, y_train)
0.02518844722054492
# Cross validate (cv = 10)
cv_score = cross_val_score(regressor, X_train, y_train, cv = 10)
print(cv_score)
print(cv_score.mean())
[-0.01172007 0.01862246 -0.03697121 0.0447222 -0.49058435 -0.26690798 0.08883241 -0.00636709 -0.1475329 -0.05098391] -0.08588904218468021
the result of cross validation looks bad, let's create function to evaluate model on a few different scores (MAE, MSE, R^2)
def show_scores(model, X_train, X_test, y_train, y_test):
train_preds = model.predict(X_train)
test_preds = model.predict(X_test)
scores = {'Training MAE': mean_absolute_error(y_train, train_preds),
'Test MAE': mean_absolute_error(y_test, test_preds),
'Training MSE': mean_squared_error(y_train, train_preds),
'Test MSE': mean_squared_error(y_test, test_preds),
'Training R^2': r2_score(y_train, train_preds),
'Test R^2': r2_score(y_test, test_preds)}
return scores
show_scores(regressor, X_train, X_test , y_train, y_test)
{'Training MAE': 0.010151141025350294,
'Test MAE': 0.008975286401861186,
'Training MSE': 0.00018134988363696758,
'Test MSE': 0.00023418663536456003,
'Training R^2': 0.02518844722054492,
'Test R^2': 0.018481907093204297}
try to visualize our model's prediction using 'unhealthy diet'
test_plot = X_test.copy()
test_plot['CRF'] = y_test
test_plot['CRF_pred'] = regressor.predict(X_test)
fig, ax = plt.subplots(figsize=[10,8])
sns.regplot(x = 'Unhealthy diet', y = 'CRF', data = test_plot, ax = ax, label='CRF')
sns.regplot(x = 'Unhealthy diet', y = 'CRF_pred', data = test_plot, ax = ax, label='CRF_pred')
plt.legend()
<matplotlib.legend.Legend at 0x1d9ca0c9400>
Our Ridge regressor fails to make a good prediction but it somehow captures the tendency of our target.
models = {'Ridge':Ridge(random_state=28),
'SVR':SVR(),
'RandomForest':RandomForestRegressor(),
'XGBoost':XGBRegressor(n_estimators = 1000, learning_rate = 0.05)}
# build the function that tests each model
def model_build(model, X_train, y_train, X_test, y_test, scale=True):
if scale:
regressor = Pipeline([
('scaler', StandardScaler()),
('estimator', model)
])
else:
regressor = Pipeline([
('estimator', model)
])
# Training
regressor.fit(X_train, y_train)
# Scoring the training set
train_preds = regressor.predict(X_train)
print(f"R2 on single split: {regressor.score(X_train, y_train)}")
# Cross validate (cv = 10)
cv_score = cross_val_score(regressor, X_train, y_train, cv = 10)
print(f"Cross validate R2 score: {cv_score.mean()}")
# Scoring the test set
for k, v in show_scores(regressor, X_train, X_test , y_train, y_test).items():
print(" ", k, v)
for name, model in models.items():
print(f"==== Scoring {name} model====")
if name == 'RandomForest' or name == 'XGBoost':
model_build(model, X_train, y_train, X_test, y_test, scale=False)
else:
model_build(model, X_train, y_train, X_test, y_test,)
print()
==== Scoring Ridge model====
R2 on single split: 0.02518844722054492
Cross validate R2 score: -0.08588904218468021
Training MAE 0.010151141025350294
Test MAE 0.008975286401861186
Training MSE 0.00018134988363696758
Test MSE 0.00023418663536456003
Training R^2 0.02518844722054492
Test R^2 0.018481907093204297
==== Scoring SVR model====
R2 on single split: -1.8616603826053892
Cross validate R2 score: -2.8225076354249565
Training MAE 0.02090212196988749
Test MAE 0.02316032304292973
Training MSE 0.0005323713859507565
Test MSE 0.0006123915073630173
Training R^2 -1.8616603826053892
Test R^2 -1.5666423853930498
==== Scoring RandomForest model====
R2 on single split: 0.8542226279902894
Cross validate R2 score: -0.2202895874639843
Training MAE 0.003958854252978305
Test MAE 0.00979900746685763
Training MSE 2.7119815492015505e-05
Test MSE 0.00022582049008997878
Training R^2 0.8542226279902894
Test R^2 0.05354591893190397
==== Scoring XGBoost model====
R2 on single split: 0.9850731375236392
Cross validate R2 score: -0.49300521739264686
Training MAE 0.0012357668324580154
Test MAE 0.01145319701592231
Training MSE 2.776931362205015e-06
Test MSE 0.00025083605519616534
Training R^2 0.9850731375236392
Test R^2 -0.05129879057847342
XGBoost model shows the best perfromance.
Let's try some hyperparameter tunning with a simple GridSearch:
xgb = XGBRegressor()
parameters = {'nthread':[4],
'objective':['reg:squarederror'],
'learning_rate': [.03, 0.05, .07],
'max_depth': [5, 6, 7],
'min_child_weight': [4],
'subsample': [0.7],
'colsample_bytree': [0.7],
'n_estimators': [500, 1000]}
xgb_best = XGBRegressor(colsample_bytree = 0.7,
learning_rate = 0.05,
max_depth = 6,
min_child_weight = 4,
n_estimators = 500,
nthread = 4,
objective = 'reg:squarederror',
subsample = 0.7)
model_build(xgb_best, X_train, y_train, X_test, y_test, scale=False)
R2 on single split: 0.9838036244337771
Cross validate R2 score: -0.5840435767181696
Training MAE 0.0012491862999874923
Test MAE 0.010256749833345744
Training MSE 3.0131062931090096e-06
Test MSE 0.00021171276062697787
Training R^2 0.9838036244337771
Test R^2 0.11267393747238974
def plotTest(col, target, data):
fig, ax = plt.subplots(figsize=[10,8])
sns.regplot(x = col, y = target, data = data, ax = ax, label=target)
sns.regplot(x = col, y = target+'_pred', data = data, ax = ax, label=target+'_pred')
plt.legend()
plotTest('Unhealthy diet', 'CRF', test_plot)
plotTest('Healthy diet', 'CRF', test_plot)
plotTest('Obesity', 'CRF', test_plot)
We can see obesity → CRF performs better than Food consumption → CRF
intake_obesity = intake[intake.Country != 'Yemen'][['Healthy diet','Unhealthy diet','Obesity']]
intake_obesity = shuffle(intake_obesity)
obesity_features = intake_obesity.columns.drop('Obesity')
obesity_target = 'Obesity'
print('Model features: ', obesity_features)
print('Model target: ', obesity_target)
X = intake_obesity[obesity_features]
y = intake_obesity[obesity_target]
Model features: Index(['Healthy diet', 'Unhealthy diet'], dtype='object') Model target: Obesity
train_data, test_data = train_test_split(intake_obesity, test_size = 0.2, shuffle = True, random_state = 28)
X_train = train_data[obesity_features]
y_train = train_data[obesity_target]
X_test = test_data[obesity_features]
y_test = test_data[obesity_target]
models = {'Ridge':Ridge(random_state=28),
'SVR':SVR(),
'RandomForest':RandomForestRegressor(),
'XGBoost':XGBRegressor(n_estimators = 1000, learning_rate = 0.05)}
for name, model in models.items():
print(f"==== Scoring {name} model====")
if name == 'RandomForest' or name == 'XGBoost':
model_build(model, X_train, y_train, X_test, y_test, scale=False)
else:
model_build(model, X_train, y_train, X_test, y_test,)
print()
==== Scoring Ridge model====
R2 on single split: 0.4036798697080629
Cross validate R2 score: 0.3265889437594536
Training MAE 5.742560785562591
Test MAE 5.727650552194383
Training MSE 52.39241620378692
Test MSE 53.19582964772561
Training R^2 0.4036798697080629
Test R^2 0.403387119537282
==== Scoring SVR model====
R2 on single split: 0.3822127620219572
Cross validate R2 score: 0.3074973556224159
Training MAE 5.417177340469058
Test MAE 5.50242170929448
Training MSE 54.27850654930544
Test MSE 52.42117693086467
Training R^2 0.3822127620219572
Test R^2 0.4120751650443313
==== Scoring RandomForest model====
R2 on single split: 0.8750314424169654
Cross validate R2 score: 0.022825717747681162
Training MAE 2.430330578512393
Test MAE 5.23261290322581
Training MSE 10.979680793388418
Test MSE 42.81732287096775
Training R^2 0.8750314424169654
Test R^2 0.5197862971417633
==== Scoring XGBoost model====
R2 on single split: 0.9999987873639742
Cross validate R2 score: -0.3465089069580724
Training MAE 0.007618268856332156
Test MAE 5.841432168406825
Training MSE 0.00010654165127486585
Test MSE 74.21287391795862
Training R^2 0.9999987873639742
Test R^2 0.1676724140065584
model = XGBRegressor(n_estimators = 1000, learning_rate = 0.05)
model.fit(X_train, y_train)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
gamma=0, gpu_id=-1, importance_type=None,
interaction_constraints='', learning_rate=0.05, max_delta_step=0,
max_depth=6, min_child_weight=1, missing=nan,
monotone_constraints='()', n_estimators=1000, n_jobs=4,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
validate_parameters=1, verbosity=None)
test_preds = model.predict(X_test)
test_plot = X_test.copy()
test_plot['Obesity'] = y_test
test_plot['Obesity_pred'] = test_preds
plotTest('Healthy diet', 'Obesity', test_plot)
plotTest('Unhealthy diet', 'Obesity', test_plot)
Based on all the analysis results above, we can simply generalize the main observations:
Therefore, we can come to our final conclusion that a healthy diet could help prevent COVID-19 only in the way that it gets people rid of obesity. In other words, people who are overweight or obeses due to an imbalanced diet may be at higher risk of illness. The potential reason could be people who eat a well-balanced diet tend to be healthier with stronger immune systems, as suggested by WHO (2021). To avoid getting ill, we suggest obses people eat a healthier diet with more Vegetables + Vegetal Products and less Animal fats + Animal Products + Meat.
This was a difficult year when COVID-19 pandemic made us pay attention to our health. The virus makes it clear that not everything in the world of health is under our control. However, our research proves that many of us are lucky enough to have a say in one important element and that is what we eat. Healthy diets play an important role in our overall health and immune systems (FAO, 2021). The food we put in our bodies directly affects the way that we feel and the way our bodies function. This is as true during an illness as it is before or after.